Classification of Handwritten Digits

Anna Androvitsanea

aandrovitsanea@aueb.gr

In this project I present a program that classifies handwritten digits.

The programm is written in Python and consists of the following steps.

Introduction

Import libraries

Import data

Exploratory data analysis

Task 1

Step 1: Calculate SVD

For each digit I will calculate the SVD, based on the training data.

Specifically, the singular value decomposition of the m × n ${\displaystyle m\times n} m\times n$ complex matrix M will be calculated as the factorization of the form $${\displaystyle \mathbf {U\Sigma V^{*}}}$$

Step 2: Tuning

For the given number of basis vectors, n = 5 - 20, I extract the first n columns of the U matrix of the calculated SVD.

I will then calculate the residuals based on the relation $||(I - U_kU_k^T)z ||$ for each digit.

The minimum residual per digit for each test image will yield the digit that the image represents.

Step 3: Scores

Task 2

Best results

For the number of basis vectors that provide the best accuracy, aka n = 18 I will find the relevant scores per digit.

Worst digits

I notice that digits **8**, **3**, **5** and **9** have the worse accuracy, meaning that they have been the most difficult to get identified.

Let's take a closer look on their plots

Digit 8

Looking at the plots I notice that the quality of the image is often pure and therefore difficult to classify.

Digit 3

Looking at the plots I notice that the quality of the image is often pure and therefore difficult to classify.

Digit 5

Looking at the plots I notice that the quality of the image is often pure and therefore difficult to classify.

Digit 9

Looking at the plots I notice that the quality of the image is often pure and therefore difficult to classify.

But let's also check digits **0** and **1** which have the highest score in regard to their accurate classification.

Best digits

Digit 0

Looking at the plots I notice that the quality of the image is excelent and therefore easy for digit 0 to be classified.

This is rather due to the fact that 0 is a simple figure, being only one circle. Someone should try hard in order to write an illegible 0.

Digit 1

Looking at the plots I notice that the quality of the image is excelent and therefore easy for digit 1 to be classified.

This is rather due to the fact that 1 is a simple figure, with no curves. Someone should try hard in order to write an illegible 1.

Task 3

I will check the singular values for the different digits. These are the diagonal entries ${\displaystyle \sigma _{i}=\Sigma _{ii}}$ of $ {\displaystyle\mathbf {\Sigma}}$

Signular values

Max singular vales per class

I notice that digits 0 and 1, that count as the best classifiable digits have the highest singular values, while digit 5 which is the worst classifiable digit has the lowest.

I will check the distribution of the singular values.

Plots

I notice that the distribution follows the Zipf's law, where the bigger values are the few one concentrated between 0 and 50 components.

Therefore we will plot again the first 40 basis vectors and check graph again.

I notice that in most cases the curve drops at around 3 components and becomes almost horizontal around 20 components.

I plot again the first 10 basis vectors and check graph again.

Ι quantify this behaviour by taking only 1 - 15 basis vector of the SVD and counting the residuals.

I extract the first n columns of the **U** matrix of the calculated SVD. I then calculate the residuals based on the relation $||(I - U_kU_k^T)z ||$ for each digit.

Residuals

Plot

All digits have a high number of residuals when taken 1 to 3 basis vectors. There is a slight differentiation thought, since for example digits 1 drops its residuals down to $< 2$ already after 4 basis components. On the other hand digit 6 remains with a high number of residuals up to the use of 15 basis components.

Therefore, we could consider variation on the number of basis vectors when modeling the classifiacatoin of handwritten digits.

Accuracy

I check how many digits get correctly classified when using only one basis vector. I set a threshold of 70% for the minimum residuals per image and compare. If their ratio is $< 0.7$ then I regard the classification as succesfull, else I proceed using the 10 first basis vectors.

Concluding, the results is satisfactory when using the same number of basis vectors for all digits.